AI models can now solve complex programming tasks in hours while still failing at simple everyday questions. According to Andrej Karpathy, that is not a contradiction but a reflection of how progress in AI is uneven across domains.
Karpathy says there are currently two very different perspectives on AI progress. One group has tried the free version of ChatGPT or the voice mode and formed an opinion based on obvious mistakes, weak reasoning, and hallucinations. In his view, however, those older or less capable models no longer reflect the current frontier.
The second group uses the latest professional-grade systems such as OpenAI Codex or Claude Code in technical domains like programming, mathematics, and research. There, Karpathy argues, progress this year has been dramatic: these models can independently refactor entire codebases or identify security vulnerabilities. As a result, the two groups are often talking past each other.
“It is simultaneously true that OpenAI’s free and, in my opinion, somewhat neglected Advanced Voice Mode fails at the dumbest questions in Instagram Reels, while at the same time OpenAI’s most expensive paid Codex model can spend an hour coherently restructuring an entire codebase or finding and exploiting vulnerabilities in computer systems.” - Х
Behind that observation is a deeper point: fields such as coding and mathematics, where outcomes can be clearly verified and reinforced through feedback, are currently benefiting far more from AI progress than areas without clean evaluation metrics, such as writing, consulting, or open-ended advice.
Verifiability as the key to progress
Karpathy’s argument touches on one of the central questions in AI research today: can language models develop into a more general intelligence, or can they only be optimized to perform efficiently in specific domains with well-defined feedback loops?
He addressed this structural issue in an earlier essay on what he called the “Software 2.0” paradigm. In that framework, the critical factor is not whether a task can be precisely specified, but whether it can be verified. Only when a system can receive automated feedback, such as right-or-wrong judgments or clear reward signals, can it be effectively improved through reinforcement learning. As Karpathy put it, “the more verifiable a task is, the better it can be automated in this new programming paradigm.”
Last summer, rumors circulated about a possible “Universal Verifier” at OpenAI that could extend reinforcement learning across all areas of knowledge. So far, however, nothing concrete has emerged. Meanwhile, Jerry Tworek, one of the leading figures behind OpenAI’s reinforcement learning strategy, has left the company and recently said that “deep learning research is essentially complete” - Х
ES
EN